{
 "cells": [
  {
   "cell_type": "markdown",
   "source": [
    "# Partition based time series clustering in sktime\n",
    "\n",
    "Partition based clustering algorithms for time series are those where k\n",
    "clusters are created from n time series. Each time series inside a cluster\n",
    "are homogenous (similar) to each other and heterogeneous (dissimilar) to\n",
    "those outside the cluster.\n",
    "\n",
    "To measure homogeneity and place a time series into a cluster, a metrics is\n",
    "used (normally a distance). This metric is used to compare a time\n",
    "series to k other time series that are representations of each cluster.\n",
    "These representations are derived using a statistical sampling technique\n",
    "from when the model was trained. The time series is then assigned to the\n",
    "cluster that is most similar (or has the best score based on the metric),\n",
    "and is said to be a part of that cluster.\n",
    "\n",
    "Currently, in Sktime there are 2 supported partition algorithms supported.\n",
    "They are: K-means and K-medoids. These will be demonstrated below\n",
    "\n",
    "## Imports"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "from sktime.clustering._k_means import TimeSeriesKMeans\n",
    "from sktime.clustering._k_medoids import TimeSeriesKMedoids\n",
    "from sktime.clustering.evaluation._plot_clustering import plot_cluster_algorithm\n",
    "from sktime.clustering.tests._clustering_tests import generate_univaritate_series\n",
    "from sktime.datasets import load_arrow_head"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## Load data"
   ],
   "metadata": {
    "collapsed": false
   }
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(158, 1) (158,) (53, 1) (53,)\n"
     ]
    }
   ],
   "source": [
    "X, y = load_arrow_head(return_X_y=True)\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y)\n",
    "print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "## K-partition algorithms\n",
    "\n",
    "K-partition algorithms are those that seek to create k clusters and assign\n",
    "each time series to a cluster using a distance measure. In Sktime the\n",
    "following are currently supported:\n",
    "<li>K-means - one of the most well known clustering algorithms. The goal of\n",
    "the K-means algorithm is to minimise a criterion known as the inertia or\n",
    "within-cluster sum-of-squares (uses the mean of the values in a cluster as\n",
    "the center).</li>\n",
    "<li>K-medoids - A similar algorithm to K-means but instead of using the mean of\n",
    "each cluster to determine the center, the median series is used.\n",
    "<br><br>\n",
    "Algorithmically the common way to approach these k-partition algorithms is\n",
    "known as Lloyds algorithm. This is an iterative process that involves:\n",
    "<li>Initialisation</li>\n",
    "<li>Assignment</li>\n",
    "<li>Updating centroid</li>\n",
    "<li>Repeat assigment and updating until convergence</li>\n",
    "<br>\n",
    "\n",
    "### Center initialisation\n",
    "Center initialisation is the first step and has been found to be critical in\n",
    "obtaining good results. There are three main center initialisation that will\n",
    "now be outlined:\n",
    "<ul>\n",
    "    <li> K-means, K-medoids supported center initialisation:\n",
    "        <ul>\n",
    "            <li>Random - This is where each sample in the training dataset is\n",
    "            randomly assigned to a cluster and update centers is run based on\n",
    "            these random assignments i.e. for k-means the mean of these\n",
    "            randomly assigned clusters is used and for k-medoids the median of\n",
    "            these randomly assigned clusters is used as the initial center.</li>\n",
    "            <li>Forgy - This is where k random samples are chosen from the\n",
    "            training dataset</li>\n",
    "            <li>K-means++ - This is where the first center is randomly chosen\n",
    "            from the training dataset. Then for all the other time series in\n",
    "            the training dataset, the distance between them and the randomly\n",
    "            selected center is taken. The next center is then chosen from the\n",
    "            training dataset so that the probability of being chosen is\n",
    "            proportional to the distance (i.e. the greater the distance the\n",
    "            more likely the time series will be chosen). The remaining\n",
    "            centers are then generated using the same method. Thie means each\n",
    "            center will be generated so that the probability of a being\n",
    "            chosen becomes proportional to its closest center (i.e. the further\n",
    "            from an already chosen center the more likely it will be chosen).\n",
    "            </li>\n",
    "         </ul>\n",
    "     </li>\n",
    " </ul>\n",
    " <br>\n",
    " These three cluster initialisation algorithms have been implemented and can\n",
    " be chosen to use when constructing either k-means or k-medoids partitioning\n",
    " algorithms by parsing the string values 'random' for random iniitialisation,\n",
    " 'forgy' for forgy and 'k-means++' for k-means++.\n",
    "\n",
    "### Assignment (distance measure)\n",
    "How a time series is assigned to a cluster is based on its distance from it to\n",
    "the cluster center. This means for some k-partition approaches, different\n",
    "distances can be used to get different results. While details on each of the\n",
    "distances won't be discussed here the following are supported, and their\n",
    "parameter values are defined:\n",
    "<ul>\n",
    "    <li>K-means, K-medoids supported distances:\n",
    "        <ul>\n",
    "            <li>Euclidean - parameter string 'euclidean'</li>\n",
    "            <li>DTW - parameter string 'dtw'</li>\n",
    "            <li>DDTW - parameter string 'ddtw'</li>\n",
    "            <li>WDTW - parameter string 'wdtw'</li>\n",
    "            <li>WDDTW - parameter string 'wddtw'</li>\n",
    "            <li>LCSS - parameter string 'lcss'</li>\n",
    "            <li>ERP - parameter string 'erp'</li>\n",
    "            <li>MSG - parameter string 'msm'</li>\n",
    "            <li>TWE - parameter string 'twe'</li>\n",
    "        </ul>\n",
    "    </li>\n",
    "</ul>\n",
    "\n",
    "### Updating centroids\n",
    "Center updating is key to the refinement of clusters to improve their quality.\n",
    "How this is done depends on the algorithm, but the following are supported\n",
    "<ul>\n",
    "    <li> K-means\n",
    "        <ul>\n",
    "            <li>Mean average - This is a standard mean average creating a new\n",
    "             series that is that average of all the series inside the cluster.\n",
    "             Ideal when using euclidean distance. Can be specified to use\n",
    "             by parsing 'means' as the parameter for averaging_algorithm to\n",
    "             k-means</li>\n",
    "            <li>DTW Barycenter averaging (DBA) - This is a specialised\n",
    "            averaging metric that is intended to be used with the dtw\n",
    "            distance measure as it used the dtw matrix path to account for\n",
    "            allignment when determining an average series. This provides a\n",
    "            much better representation of the mean for time series. Can be\n",
    "            specified to use by parsing 'dba' as the parameter for\n",
    "            averaging_algorithm to k-means.</li>\n",
    "        </ul>\n",
    "    </li>\n",
    "    <li> K-medoids\n",
    "        <ul>\n",
    "            <li>Median - Standard meadian which is the series in the middle of\n",
    "            all series in a cluster. Used by default by k-medoids</li>\n",
    "        </ul>\n",
    "   </li>\n",
    "   </li>\n",
    "</ul>\n",
    "\n",
    "## K-means"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "k_means = TimeSeriesKMeans(\n",
    "    n_clusters=5,  # Number of desired centers\n",
    "    init_algorithm=\"forgy\",  # Center initialisation technique\n",
    "    max_iter=200,  # Maximum number of iterations for refinement on training set\n",
    "    verbose=False,  # Verbose\n",
    "    metric=\"euclidean\",  # Distance metric to use\n",
    "    averaging_algorithm=\"mean\",  # Averaging technique to use\n",
    ")\n",
    "\n",
    "k_means.fit(X_train)\n",
    "plot_cluster_algorithm(k_means, X_test, k_means.n_clusters)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n",
     "is_executing": true
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "The above shows the basic usage for K-means followed by graphs showing the\n",
    "clusters. The cluster centers are the red line in the graph, and the blue\n",
    "lines are the time series that belong to the cluster.\n",
    "<br><br>\n",
    "While it would appear to have performed decently well we can improve this by\n",
    "taking advantage of time series specific distance measures and averaging\n",
    "metrics. This will be demonstrated below using the dynamic time warping\n",
    "distance metric in combination with dtw barycenter averaging to calculate the\n",
    "centers."
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "# Note this is incredibly unoptimised at the moment and will take a while to\n",
    "# run so using dummy data so the notebook doesn't time out\n",
    "\n",
    "\n",
    "rng = np.random.RandomState(1)\n",
    "dba_X_train = generate_univaritate_series(n=100, size=20, rng=rng, dtype=np.double)\n",
    "dba_X_test = generate_univaritate_series(\n",
    "    n=50, size=20, rng=np.random.RandomState(2), dtype=np.double\n",
    ")\n",
    "\n",
    "k_means = TimeSeriesKMeans(\n",
    "    n_clusters=5,  # Number of desired centers\n",
    "    init_algorithm=\"forgy\",  # Center initialisation technique\n",
    "    max_iter=20,  # Maximum number of iterations for refinement\n",
    "    verbose=False,  # Verbose\n",
    "    metric=\"dtw\",  # Distance metric to use\n",
    "    averaging_algorithm=\"dba\",  # Averaging technique to use\n",
    "    averaging_algorithm_iterations=3,  # Number of iterations DBA is refined over\n",
    ")\n",
    "\n",
    "k_means.fit(dba_X_train)\n",
    "plot_cluster_algorithm(k_means, dba_X_test, k_means.n_clusters)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n",
     "is_executing": true
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "When using dba an additional parameter can be parsed that is called\n",
    "'averaging_algorithm_iterations' which is the number of iterations DBA is\n",
    "refined over. This does not need to be a large value as DBA converges very\n",
    "quickly.\n",
    "\n",
    "## K-medoids"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "k_medoids = TimeSeriesKMedoids(\n",
    "    n_clusters=5,  # Number of desired centers\n",
    "    init_algorithm=\"forgy\",  # Center initialisation technique\n",
    "    max_iter=200,  # Maximum number of iterations for refinement on training set\n",
    "    verbose=False,  # Verbose\n",
    "    metric=\"euclidean\",  # Distance metric to use\n",
    ")\n",
    "\n",
    "k_medoids.fit(X_train)\n",
    "plot_cluster_algorithm(k_medoids, X_test, k_medoids.n_clusters)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n",
     "is_executing": true
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "The above shows the basic usage for K-medoids. This algorithm\n",
    "works essentially the same as K-means but instead of updating\n",
    "the center by taking the mean of the series inside the cluster,\n",
    "it takes the median series. This means the center is an existing\n",
    "series in the dataset and NOT a generated one.\n",
    "<br><br>\n",
    "The parameter key to k-medoids is the metric and is what we\n",
    "can adjust to improve performance for time series. An example\n",
    "of using dtw as the metric is given below.\n"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "outputs": [],
   "source": [
    "k_medoids = TimeSeriesKMedoids(\n",
    "    n_clusters=5,  # Number of desired centers\n",
    "    init_algorithm=\"forgy\",  # Center initialisation technique\n",
    "    max_iter=200,  # Maximum number of iterations for refinement on training set\n",
    "    verbose=False,  # Verbose\n",
    "    metric=\"dtw\",  # Distance metric to use\n",
    ")\n",
    "\n",
    "k_medoids.fit(X_train)\n",
    "plot_cluster_algorithm(k_medoids, X_test, k_medoids.n_clusters)"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n",
     "is_executing": true
    }
   }
  },
  {
   "cell_type": "markdown",
   "source": [
    "As can be seen again the centers are more refined using a dtw\n",
    "distance metric.\n",
    "\n",
    "## Custom initializer algorithm\n",
    "Todo\n",
    "\n",
    "## Custom metric\n",
    "Todo\n",
    "\n",
    "## Custom averaging algorithm\n",
    "Todo"
   ],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%% md\n"
    }
   }
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "outputs": [],
   "source": [],
   "metadata": {
    "collapsed": false,
    "pycharm": {
     "name": "#%%\n"
    }
   }
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
